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Abstract 

Privacy-preserving  data  mining  is  an  important  is¬ 
sue  in  the  areas  of  data  mining  and  security.  In  this 
paper,  we  study  how  to  conduct  association  ride  min¬ 
ing,  one  of  the  core  data  mining  techniques,  on  private 
data  in  the  following  scenario:  Multiple  parties,  each 
having  a  private  data  set,  want  to  jointly  conduct  as¬ 
sociation  rule  mining  without  disclosing  their  private 
data  to  other  parties.  Because  of  the  interactive  na¬ 
ture  among  parties,  developing  a  secure  framework  to 
achieve  such  a  computation  is  both  challenging  and  de¬ 
sirable.  In  this  paper,  we  present  a  secure  framework 
for  multiple  parties  to  conduct  privacy-preserving  asso¬ 
ciation  ride  mining. 
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1  INTRODUCTION 

Business  successes  are  no  longer  the  result  of  an  in¬ 
dividual  toiling  in  isolation;  rather  successes  are  depen¬ 
dent  upon  collaboration,  team  efforts,  and  partnership. 
In  the  modern  business  world,  collaboration  becomes 
especially  important  because  of  the  mutual  benefit  it 
brings.  Sometimes,  such  a  collaboration  even  occurs 
among  competitors,  or  among  companies  that  have 
conflict  of  interests,  but  the  collaborators  are  aware 
that  the  benefit  brought  by  such  a  collaboration  will 
give  them  an  advantage  over  other  competitors.  For 
this  kind  of  collaboration,  data’s  privacy  becomes  ex¬ 
tremely  important:  all  the  parties  of  the  collaboration 
promise  to  provide  their  private  data  to  the  collabora¬ 
tion,  but  neither  of  them  wants  each  other  or  any  third 
party  to  learn  much  about  their  private  data. 

This  paper  studies  a  very  specific  collaboration  that 
becomes  more  and  more  prevalent  in  the  business 


world.  The  problem  is  the  collaborative  data  min¬ 
ing.  Data  mining  is  a  technology  that  emerges  as  a 
means  for  identifying  patterns  and  trends  from  a  large 
quantity  of  data.  The  goal  of  our  studies  is  to  develop 
technologies  to  enable  multiple  parties  to  conduct  data 
mining  collaboratively  without  disclosing  their  private 
data. 

In  recent  times,  the  explosion  in  the  availability  of 
various  kinds  of  data  has  triggered  tremendous  oppor¬ 
tunities  for  collaboration,  in  particular  collaboration  in 
data  mining.  The  following  is  some  realistic  scenarios: 

1.  Multiple  competing  supermarkets,  each  having  an 
extra  large  set  of  data  records  of  its  customers’ 
buying  behaviors,  want  to  conduct  data  mining  on 
their  joint  data  set  for  mutual  benefit.  Since  these 
companies  are  competitors  in  the  market,  they  do 
not  want  to  disclose  too  much  about  their  cus¬ 
tomers’  information  to  each  other,  but  they  know 
the  results  obtained  from  this  collaboration  could 
bring  them  an  advantage  over  other  competitors. 

2.  Several  pharmaceutical  companies,  each  have  in¬ 
vested  a  significant  amount  of  money  conduct¬ 
ing  experiments  related  to  human  genes  with  the 
goal  of  discovering  meaningful  patterns  among  the 
genes.  To  reduce  the  cost,  the  companies  decide  to 
join  force,  but  neither  wants  to  disclose  too  much 
information  about  their  raw  data  because  they  are 
only  interested  in  this  collaboration;  by  disclosing 
the  raw  data,  a  company  essentially  enables  other 
parties  to  make  discoveries  that  the  company  does 
not  want  to  share  with  others. 

To  use  the  existing  data  mining  algorithms,  all 
parties  need  to  send  their  data  to  a  trusted  central 
place  (such  as  a  super-computing  center)  to  conduct 
the  mining.  However,  in  situations  with  privacy  con¬ 
cerns,  the  parties  may  not  trust  anyone.  We  call  this 
type  of  problem  the  Privacy-preserving  Collaborative 
Data  Mining  (PCDM)  problem.  For  each  data  min- 
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ing  problem,  there  is  a  corresponding  PCDM  prob¬ 
lem.  Fig.l  shows  how  a  traditional  data  mining  prob¬ 
lem  could  be  transformed  to  a  PCDM  problem  (this 
paper  only  focuses  on  the  heterogeneous  collaboration 
(Fig. l.c)) (heterogeneous  collaboration  means  that  each 
party  has  different  sets  of  attributes.  Homogeneous  col¬ 
laboration  means  that  each  party  has  the  same  sets  of 
attributes.) 


ing  Association  Rules  On  Private  Data  (MAP)  prob¬ 
lem  defined  as  follows:  multiple  parties  want  to  con¬ 
duct  association  rule  mining  on  a  data  set  that  consist 
all  the  parties’  private  data,  and  neither  party  is  willing 
to  disclose  its  raw  data  to  other  parties. 

The  existing  research  on  association  rule  mining  [8] 
provides  the  basis  for  the  collaborative  association  rule 
mining.  However,  none  of  those  methods  satisfy  the 
security  requirements  of  MAP  or  can  be  trivially  mod¬ 
ified  to  satisfy  them.  With  the  increasing  needs  of 
privacy-preserving  data  mining,  more  and  more  people 
are  interested  in  finding  solutions  to  the  MAP  prob¬ 
lem.  Vaidya  and  Clifton  proposed  a  solution  [9]  for  two 
parties  to  conduct  privacy-preserving  association  rule 
mining.  However,  for  the  general  case  where  more  than 
two  parties  are  involved,  the  MAP  problem  presents  a 
much  greater  challenge. 

The  paper  is  organized  as  follows:  The  related  work 
is  discussed  in  Section  2.  We  describe  the  associa¬ 
tion  rule  mining  procedure  in  Section  3.  We  then  for¬ 
mally  define  our  proposed  secure  protocol  in  Section  4. 
In  Section  5,  we  conduct  security  and  communication 
analysis.  We  give  our  conclusion  in  Section  6. 

2  RELATED  WORK 


Figure  1.  Privacy  Preserving  Non- 
collaborative  and  Collaborative  Data  Mining 

Generic  solutions  for  any  kind  of  secure  collaborative 
computing  exist  in  the  literature  [5].  These  solutions 
are  the  results  of  the  studies  of  the  Secure  Multi-party 
Computation  problem  [10,  5],  which  is  a  more  general 
form  of  secure  collaborative  computing.  However,  none 
of  the  proposed  generic  solutions  is  practical;  they  are 
not  scalable  and  cannot  handle  large-scale  data  sets  be¬ 
cause  of  the  prohibitive  extra  cost  in  protecting  data’s 
privacy.  Therefore,  practical  solutions  need  to  be  de¬ 
veloped.  This  need  underlies  the  rationale  for  our  re¬ 
search. 

Data  mining  includes  a  number  of  different  tasks, 
such  as  association  rule  mining,  classification,  and  clus¬ 
tering.  This  paper  studies  the  association  rule  min¬ 
ing  problem.  The  goal  of  association  rule  mining  is 
to  discover  meaningful  association  rules  among  the  at¬ 
tributes  of  a  large  quantity  of  data.  For  example,  let 
us  consider  the  database  of  a  medical  study,  with  each 
attribute  representing  a  symptom  found  in  a  patient. 
A  discovered  association  rule  pattern  could  be  “70%  of 
patients  who  are  drug  injection  takers  also  have  hep¬ 
atitis”  .  This  information  can  be  useful  for  the  disease- 
control,  medical  research,  etc.  Based  on  the  existing 
association  rule  mining  technologies,  we  study  the  Min¬ 


Secure  Multi-party  Computation 

Briefly,  a  Secure  Multi-party  Computation  (SMC) 
problem  deals  with  computing  any  function  on  any  in¬ 
put,  in  a  distributed  network  where  each  participant 
holds  one  of  the  inputs,  while  ensuring  that  no  more 
information  is  revealed  to  a  participant  in  the  com¬ 
putation  than  can  be  inferred  from  that  participant’s 
input  and  output.  The  SMC  problem  literature  was  in¬ 
troduced  by  Yao  [10].  It  has  been  proved  that  for  any 
function,  there  is  a  secure  multi-party  computation  so¬ 
lution  [5[.  The  approach  used  is  as  follows:  the  function 
F  to  be  computed  is  first  represented  as  a  combinato¬ 
rial  circuit,  and  then  the  parties  run  a  short  protocol  for 
every  gate  in  the  circuit.  Every  participant  gets  corre¬ 
sponding  shares  of  the  input  wires  and  the  output  wires 
for  every  gate.  This  approach,  although  appealing  in 
its  generality  and  simplicity,  is  highly  impractical. 

Privacy-preservation  Data  Mining 

In  the  early  work  on  such  a  privacy-preserving  data 
mining  problem,  Lindell  and  Pinkas  [7]  propose  a  so¬ 
lution  to  the  privacy-preserving  classification  prob¬ 
lem  using  the  oblivious  transfer  protocol,  a  powerful 
tool  developed  through  the  secure  multi-party  com¬ 
putation  studies.  Another  approach  for  solving  the 
privacy-preserving  classification  problem  was  proposed 
by  Agrawal  and  Srikant  [1].  In  their  approach,  each  in¬ 
dividual  data  item  is  perturbed  and  the  distributions  of 
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the  all  data  is  reconstructed  at  an  aggregate  level.  The 
technique  works  for  those  data  mining  algorithms  that 
use  the  probability  distributions  rather  than  individ¬ 
ual  records.  In  [9],  a  solution  to  the  association  mining 
problem  for  the  case  of  two  parties  was  proposed.  In 
[3],  a  procedure  is  provided  to  build  a  classifier  on  pri¬ 
vate  data,  where  a  semi-trusted  party  was  employed  to 
improve  the  performance  of  communication  and  com¬ 
putation.  In  this  paper,  we  also  adopt  the  model  of  the 
semi-trusted  party  because  of  the  effectiveness  and  use¬ 
fulness  it  brings  and  present  a  secure  protocol  allowing 
computation  to  be  carried  out  by  the  parties. 

3  MINING  ASSOCIATION  RULES 
ON  PRIVATE  DATA 

Since  its  introduction  in  1993  [8],  the  association 
rule  mining  has  received  a  great  deal  of  attention.  It  is 
still  one  of  most  popular  pattern-discovery  methods  in 
the  field  of  knowledge  discovery.  Briefly,  an  association 
rule  is  an  expression  X  =>  Y,  where  X  and  Y  are  sets  of 
items.  The  meaning  of  such  rules  is  as  follows:  Given 
a  database  D  of  records,  X  =>■  Y  means  that  whenever 
a  record  R  contains  X  then  R  also  contains  Y  with 
certain  confidence.  The  rule  confidence  is  defined  as 
the  percentage  of  records  containing  both  X  and  Y  with 
regard  to  the  overall  number  of  records  containing  X. 
The  fraction  of  records  R  supporting  an  item  X  with 
respect  to  database  D  is  called  the  support  of  X. 

3.1  Problem  Definition 

We  consider  the  scenario  where  multiple  parties, 
each  having  a  private  data  set  (denoted  by  D i,  Z?2, 

•  •  •,  and  Dn  respectively),  want  to  collaboratively  con¬ 
duct  association  rule  mining  on  the  union  of  their  data 
sets.  Because  they  are  concerned  about  their  data’s 
privacy,  neither  party  is  willing  to  disclose  its  raw  data 
set  to  others.  Without  loss  of  generality,  we  make  the 
following  assumptions  on  the  data  sets.  The  assump¬ 
tions  can  be  achieved  by  pre-processing  the  data  sets 
D\,  _D2,  •  •  •,  and  Dn,  and  such  a  pre-processing  does 
not  require  one  party  to  send  its  data  set  to  other  par¬ 
ties.  (In  this  paper,  we  consider  applications  where 
the  identifier  of  each  data  record  is  recorded.  In  con¬ 
trast,  for  transactions  such  as  the  supermarket-buying, 
customers’  IDs  may  not  be  needed.  The  IDs  and  the 
names  of  attributes  are  known  to  all  parties  during  the 
joint  computation.  A  data  record  used  in  the  joint 
association  rule  mining  has  the  same  ID  in  different 
databases.) 

1.  D\,  Z)2,  •  •  •  and  Dn  are  binary  data  sets,  namely 


they  only  contain  0’s  and  l’s,  where  n  is  the  to¬ 
tal  number  of  parties.  (Our  method  is  applicable 
to  attributes  that  are  of  non-binary  value.  An  at¬ 
tribute  of  non-binary  value  will  be  converted  to 
a  binary  representation.  Detailed  implementation 
includes  discretizing  and  categorizing  attributes 
that  are  of  continuous  or  ordinal  values.) 

2.  D\,  Z)2,  •  •  •  and  Dn  contain  the  same  number  of 
records.  Let  N  denote  the  total  number  of  records 
for  each  data  set. 

3.  The  identities  of  the  ith  (for  i  £  [l,iV])  record  in 
D\,  -D2)  •  •  •  and  Dn  are  the  same. 

Mining  Association  Rules  On  Private  Data 
problem:  Party  1  has  a  private  data  set  D i,  party  2 
has  a  private  data  set  Z)2,  •  •  •  and  party  n  has  a  private 
data  set  Dn.  The  data  set  [D\  U  d2  U  •  •  •  U  Dn]  is  the 
union  of  D\1  D2,  •  •  •  and  Dn  (by  vertically  putting  D\, 
Di,  •  •  •  and  Dn  together  so  that  the  concatenation  of 
the  it li  row  in  D D2,  •  •  •  and  Dn  becomes  the  ith  row 
in  [D\  U  D2  U  •  •  •  U  Dn]) .  The  n  parties  want  to  conduct 
association  rule  mining  on  [D\  U  Z)2  U  •  •  •  U  Dn\  and  to 
find  the  association  rules  with  support  and  confidence 
being  greater  than  the  given  thresholds.  We  say  an  as¬ 
sociation  rule  (e.g.,  ay  =>  yj)  has  confidence  c%  in  the 
data  set  [DiUD2U-  •  -UD„]  if  in  [Z?iU.D2U-  •  -U Dn]  c% 
of  the  records  which  contain  Xi  also  contain  yj  (namely, 
c%  =  P(yj  |  Xj)).  We  say  that  the  association  rule  has 
support  a%  in  [D\  U  D2  U  •  •  •  U  Dn]  if  s%  of  the  records 
in  [D i  U  _D2  •  •  •  U  Dn\  contain  both  ay  and  yj  (namely, 

a%  =  P{xi  n  yj)). 

3.2  Association  Rule  Mining  Procedure 

The  following  is  the  procedure  for  mining  association 
rules  on  [Di  U  D2  •  •  •  U  Dn] . 

1.  L  i  =  large  1-itemsets 

2.  for  (k  =  2;  Lj.~i  ^  <j>\  k++)  do  begin 

3.  Ck  =  apriori-gen(Lfc_i) 

4.  for  all  candidates  c  £  Ck  do  begin 

5.  Compute  c. count  \ \  We  will  show  how 
to  compute  it  in  Section  3.3 

6.  end 

7.  Lk  =  {c  £  Ck\c. count  >  min-sup} 

8.  end 

9.  Return  L  =  U kLk 

The  procedure  apriori-gen  is  described  in  the  follow¬ 
ing  (please  also  see  [6]  for  details). 

apriori-gen(Lfc_i:  large  (k-l)-itemsets) 

1.  for  each  itemset  l\  £  Lk- 1  do  begin 
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2.  for  each  itemset  I2  €  Lk-i  do  begin 

3.  if  ((h[  1]  =  l2[  1])  A  (h[ 2]  =  l2[2})  A  •  •  •  A 
(h[k  -  1]  =  l2[k  -  1])  A  (h[k  -  1]  <  k[k  -  1])){ 

4.  then  c  =  l\  join  k 

5.  for  each  (k-l)-subset  s  of  c  do  begin 

6.  if  s  ^  Lfc-i 

7.  then  delete  c 

8.  else  add  c  to  Ck 

9.  end 

10.  } 

11.  end 

12.  end 

13.  return  Ck 

3.3  How  to  compute  c.  count 

If  all  the  candidates  belong  to  the  same  party,  then 
c.  count ,  which  refers  to  the  frequency  counts  for  candi¬ 
dates,  can  be  computed  by  this  party.  If  the  candidates 
belong  to  different  parties,  they  then  construct  vectors 
for  their  own  attributes  and  apply  our  number  prod¬ 
uct  protocol,  which  will  be  discussed  in  Section  4,  to 
obtain  the  c.  count.  We  use  an  example  to  illustrate 
how  to  compute  c. count  among  three  parties.  Party  1, 
party  2  and  party  3  construct  vectors  X,  Y  and  Z  for 
their  own  attributes  respectively.  To  obtain  c.  count, 
they  need  to  compute  ^iT[i]  •  Y[i }  ■  Z[i]  where  N 
is  the  total  number  of  values  in  each  vector.  For  in¬ 
stance,  if  the  vectors  are  as  depicted  in  Fig. 2,  then 
Eti  X[{\  •  Y[i\  ■  Z[i\  =  £?=i  X[{\  •  Y[{\  ■  Z[i }  =  3.  We 
provide  an  efficient  protocol  in  Section  4  for  the  parties 
to  compute  this  value  without  revealing  their  private 
data  to  each  other. 
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Figure  2.  Raw  Data  For  Alice,  Bob  and  Carol 


4  BUILDING  BLOCK 

How  two  or  multiple  parties  jointly  compute  c.  count 
without  revealing  their  raw  data  to  each  other  is  the 
challenge  that  we  want  to  address.  The  number  prod¬ 
uct  protocol  described  in  this  section  is  the  main  tech¬ 
nical  tool  used  to  compute  it.  We  will  describe  an 
efficient  solution  of  the  number  product  protocol  based 
on  a  commodity  server,  a  semi-trusted  party. 

Our  building  blocks  are  two  protocols:  The  first  pro¬ 
tocol  is  for  two  parties  to  conduct  the  multiplication 
operation.  This  protocol  differs  from  [3]  in  that  we 
consider  the  product  of  numbers  instead  of  vectors. 
Since  product  computation  can  only  applied  for  two 
vectors,  it  cannot  deal  with  the  computation  involved 
in  multiple  parties  where  more  than  two  vectors  may 
participate  in  the  computation.  The  second  protocol, 
with  the  first  protocol  as  the  basis,  is  designed  for  the 
secure  multi-party  product  operation. 

4.1  Introducing  The  Commodity  Server 

For  performance  reasons,  we  use  an  extra  server, 
the  commodity  server  [2]  in  our  protocol.  The  par¬ 
ties  could  send  requests  to  the  commodity  server  and 
receive  data  (called  commodities)  from  the  server,  but 
the  commodities  must  be  independent  of  the  parties’ 
private  data.  The  purpose  of  the  commodities  is  to 
help  the  parties  conduct  the  desired  computations. 

The  commodity  server  is  semi-trusted  in  the  follow¬ 
ing  senses:  (1)  It  should  not  be  trusted;  therefore  it 
should  not  be  possible  to  derive  the  private  informa¬ 
tion  of  the  data  from  the  parties;  it  should  not  learn 
the  computation  result  either.  (2)  It  should  not  col¬ 
lude  with  all  the  parties.  (3)  It  follows  the  protocol 
correctly.  Because  of  these  characteristics,  we  say  that 
it  is  a  semi-trusted  party.  In  the  real  world,  finding 
such  a  semi-trusted  party  is  much  easier  than  finding 
a  trusted  party. 

As  we  will  see  from  our  solutions,  the  commodity 
server  does  not  participate  in  the  actual  computation 
among  the  parties;  it  only  supplies  commodities  that 
are  independent  of  the  parties’  private  data.  There¬ 
fore,  the  server  can  generate  independent  data  off-line 
beforehand,  and  sell  them  as  commodities  to  the  prover 
and  the  verifier  (hence  the  name  “commodity  server” ) . 

4.2  Secure  Number  Product  Protocol 

Let’s  first  consider  the  case  of  two  parties  where 
n  =  2  (more  general  cases  where  n  >  3  will  be  discussed 
later).  Alice  has  a  vector  X  and  Bob  has  a  vector  Y. 
Both  vectors  have  N  elements.  Alice  and  Bob  want 
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to  compute  the  product  between  X  and  Y  such  that 
Alice  gets  YliLi  Ux[i]  and  Bob  gets  ^[i],  where 

Eti  MA  +  Eti  uv]i]  =  Eili  *[*]  •  ^[*]  =  x  ■  y. 

Uy[i]  and  Ux[i]  are  random  numbers.  Namely,  the 
scalar  product  of  X  and  Y  is  divided  into  two  secret 
pieces,  with  one  piece  going  to  Alice  and  the  other  go¬ 
ing  to  Bob.  We  assume  that  random  numbers  are  gen¬ 
erated  from  the  integer  domain. 

4.2.1  Secure  Two-party  Product  Protocol 
Protocol  1.  (Secure  Two-party  Product  Protocol) 

1.  The  Commodity  Server  generates  two  random 
numbers  [1]  and  [1]  ,  and  lets  rx[l]  +ry\  1]  = 
Rx[  1]  •  i?y[l],  where  rx{  1]  (or  ry[l])  is  a  ran¬ 
domly  generated  number.  Then  the  server  sends 
(i?x[l],rx[l])  to  Alice,  and  (Ry [1], ry [1])  to  Bob. 

2.  Alice  sends  A[l]  =  X[l]  +  f?x[l]  to  Bob. 

3.  Bob  sends  T[l]  =  F[l]  +  Ry\l]  to  Alice. 

4.  Bob  generates  a  random  number  Uy[  1],  and  com¬ 
putes  X[l]  ■  Y[  1]  +  (ry[  1]  —  Uy\T\),  then  sends  the 
result  to  Alice. 

5.  Alice  computes  (A[l]  •  F[l]  +  (ry[  1]  —  £/y[l]))  — 
(Rx [l]-Y[l])  +  rx  [1]  =X[1\-  Y [1]  -  Uy  [1]  +  (rv  [1]  - 
Rx[  1]  •  Ry[  1]  +  rx[l])  =x[l}-  Y[l]  -  Uy[  1]  =  Ux[  1]. 

6.  Repeat  step  1-5  to  compute  X[i]  ■  Y[i]  for  i  G 
[2,7V].  Alice  then  gets  Eili  Ux[i]  and  Bob  gets 

E;-,  ry[i\ 

The  bit-wise  communication  cost  of  this  protocol  is 
7  *  717  *  N,  where  M  is  the  maximum  bits  for  the  values 
involved  in  our  protocol.  The  cost  is  approximately  7 
times  of  the  optimal  cost  of  a  two-party  scalar  product 
(the  optimal  cost  of  a  scalar  product  is  defined  as  the 
cost  of  conducting  the  product  of  X  and  Y  without  the 
privacy  constraints,  namely  one  party  simply  sends  its 
data  in  plain  to  the  other  party).  The  cost  can  be 
decreased  to  3  *  M  *  N  if  the  commodity  server  just 
sends  seeds  to  Alice  and  Bob  since  the  seeds  can  be 
used  to  generate  a  set  of  random  numbers. 

4.2.2  Secure  Multi-party  Product  Protocol 

We  have  discussed  our  protocol  of  secure  number  prod¬ 
uct  for  two  parties.  Next,  we  will  consider  the  protocol 
for  securely  computing  the  number  product  for  multi¬ 
ple  parties.  For  simplicity,  we  only  describe  the  proto¬ 
col  when  n  =  3.  The  protocols  for  the  cases  when  n  >  3 
can  be  similarly  derived.  Our  concern  is  that  similarly 
derived  solution  may  not  efficient  when  the  number  of 


parties  is  large,  and  more  efficient  solution  is  still  un¬ 
der  research.  Without  loss  of  generality,  let  Alice  has  a 
private  vector  X  and  a  randomly  generated  vector  Rx , 
Bob  has  a  private  vector  Y  and  a  randomly  generated 
vector  Ry  and  let  Ry[i)  =  Ry[i\+Ry[i\  for  i  G  [1,  TV]  and 
Carol  has  a  private  vector  Z  and  a  randomly  generated 
vector  Rz.  First,  we  let  the  parties  hide  these  private 
numbers  by  using  their  respective  random  numbers, 
then  conduct  the  product  for  the  multiple  numbers. 

m  = 

(X[l]  +  Rx[l))  *  (F[l]  +  Rv[l})  *  [Z[l]  +  Rz[  1]) 

=  X[1]Y[1]Z[1]  +  X[l]Ry[l)Z[l]  +  Rx[l]Y[l]Z[l] 

+Rx[l\Ry[l}Z[l]  +  X[1}Y[1}RZ[1\  +  X[l}Ry[l}Rz[l} 

+RX[1]Y[1)RZ[1]  +  Rx[l]Ry[l]Rz[l} 


X[1]Y[1]Z[1]  +  X[l]Ry[l}Z[l}  +  RX[1]Y[1]Z[1] 
+Rx[l\Ry[l\Z[l]  +  X[l]Y[l]Rz[l]  +  X[l}Ry[l]Rz[l] 
+RX[1)Y[1]RZ[1]  +  Rx[l}{R'y[l}  +  Ry[V\)Rz[l] 

=  X[1]Y[1]Z[1]  +  X[l}Ry[l}Z[l]  +  RX[1}Y[1)Z[1} 
+Rx[l\Ry[l\Z[l]  +  X[1}Y[1}RZ[1]  +  X[l]Ry[l]Rz[l] 
+RX[1)Y[1]RZ[1]  +  Rx[l}R'y[l]Rz[l]  +  RX[1}R’'[1]RZ[1) 


X[1]Y[1]Z[1]  +  (A[l]i?y[l]  +  i?x[l]F[l]  +  Rx[l\Ry[l])Z[l] 

+(A[1]T[1]  +  X[l]Ry[l]  +  RX[1]Y[1]  +  Rx[l]R'y[l])Rz[l] 

+RX[1\R’({1}RZ[1\  =  T0[l)  +Ti[l]  +T2[1]  +  T3[1], 

where 

T0[1]  =  X[1]Y[1)Z[1\, 

Ti[  1]  =  (X[l}Ry[l)  +  Rx[l}Y[l)  +  Rx[l]Ry[l])Z[l], 

r2[i] 

=  (X[1}Y[1)  +  X[l)Ry[l}  +  RX[1)Y[1}  +  RX{1}R'V{1})RZ[1), 

T3[l]  =  Rx[1\R^[1\Rz[1]. 

To[l]  is  what  we  want  to  obtain.  To  compute  To[l]5 
we  need  to  know  T[l],  Ti[l],  ^[l]  and  T3[l].  In  this 
protocol,  we  let  Alice  get  T[l],  Bob  get  T3[l]  and  Carol 
get  Tj[l]  and  ^[l]-  Bob  separates  i?y[l]  into  R'y[  1] 
and  R”  [1] .  If  he  fails  to  do  so,  then  his  data  might  be 
disclosed  during  the  computation  of  these  terms. 

To  compute  Tj[l]  and  T2[  1],  Alice  and  Bob  can 
use  Protocol  1  to  compute  AC[1] [1],  i?x[l]F[l], 
Rx[l]Ry[l],  X[1]Y[1]  and  Rx[l]R'y[l}.  Thus,  accord¬ 
ing  to  Protocol  1,  Alice  gets  Ux[l],  £4 [2],  f7x[3], 

Ux{ 4]  and  Ux[ 5]  and  Bob  gets  Uy[  1],  Uy[ 2],  Uy[2i], 

Uy\ 4]  and  Uy[ 5].  Then  they  compute  (A[l]i?y[l]  + 
RX[1]T[1]  +  Rx[l}Ryll})  and  {X[1)Y[1]  +  X[l]Ry[l}  + 

Rx  [1]T[1]  +  i?x[l]i?y[l])  and  send  the  results  to  Carol 
who  can  then  compute  Tj[l]  and  T2[l].  Note  that 
A[l]i?y[l]  =  ux[  1]  +  Uy[  1],  Rx[l)Y[l]  =  Ux[ 2]  +  Uy[ 2], 
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Rx[l]Ry[l}  =  Ux[ 3]  +  Uy[ 3],  X[1]Y[1]  =  Ux[ 4]  +  Uy[ 4] 

andi^[l]i^[l]  =  C^[5]  +  C^[5]. 

To  compute  T3  [1] ,  Alice  and  Carol  use  Protocol  1  to 
compute  i?x[l]i?z[l],  then  send  the  results  to  Bob  who 
can  then  compute  T3[l]. 

To  compute  T[l],  Bob  sends  Y[l]  +  i?y[l]  to  Alice 
and  Carol  sends  Z[  1]  +  i?z[l]  to  Alice. 

Repeat  the  above  process  to  compute  T[i] ,  Ti[i\, 
T2[i\,  T3[i]  and  Tq[{\  for  i  €  [2,  TV] .  Then,  Al¬ 
ice  has  Bob  has  ^iT3[j]  and  Carol  has 

Eh^AA  and  Eh'U'AA-  Finally,  we  achieve  the 
goal  and  obtain  Eh  X[i]Y[i]Z[i]  =  EhTo[^\  = 

Eh  m  Eh  m  -  Eh  T2[i\  -  Eh  m. 

Protocol  2.  (Secure  Multi-party  Product  Protocol) 

Step  I:  Random  Number  Generation 

1.  Alice  generates  a  random  number  Rx[  1]. 

2.  Bob  generates  two  random  numbers  R'y  [1]  and 

Kl  !]• 

3.  Carol  generates  a  random  number  i?z[l]. 

Step  II:  Collaborative  Computing 

Sub-step  1:  To  Compute  T[l] 

1.  Carol  computes  Z\  1]  +  f?,[l]  and  sends  it  to  Bob. 

2.  Bob  computes  F[l]  +  and  sends  it  to  Alice. 

3.  Alice  computes  T[l]  =  (AC [1]  +  [1] )  *  (Y[l]  + 

Ry[l])*(Z[l}+Rz[l}). 

Sub-step  2:  To  Compute  Ti[l]  and  T2[l] 

1.  Alice  and  Bob  use  Protocol  1  to  compute  A'[l]  • 
Ry[  1],  Rx[  1]  •  Y[  1],  Rx[  1]  •  Ry[  1],  X[l]  ■  Y[l]  and 
Rx[  1]  •  Ry[1]-  Then  Alice  obtains  Ux[  1],  Ux [2], 
Ux[ 3],  C4[4]  and  Ux[ 5]  and  Bob  obtains  Uy [1], 
Uy[ 2],  Uy[ 3],  Uy[ 4]  and  Uy[ 5]. 

2.  Alice  sends  (Ux[  1]  +  Ux[ 2]  +  f7x[3])  and  (£4  [4]  + 
Ux[l]  +  Ux[2]  +  Ux[5})  to  Carol. 

3.  Bob  sends  (Uv[  1]  +  Uv[ 2]  +  Uv[ 3])  and  (UV[A\  + 
Uy  [1]  +  Uy  [2]  +  Uy  [5] )  to  Carol. 

4.  Carol  computes 

7\[1]  =  (X[l]  •  Ry[  1]  +  Rx[  1]  •  Y[l]  +  Rx[  1]  •  Ry[l])  • 
Z[  1],  and 

T2[1]  =  (A[1]  •  y[i]  +  X[1]  ■  Ry[i]  +  Rx[ i]  •  y[i]  + 
Rx[l)-R'y[l])-Rz[l]- 


Sub-step  3:  To  Compute  T3[l] 


1.  Alice  and  Carol  use  Protocol  1  to  compute  RX]T\  ■ 
Rz[  1]  and  send  the  values  they  obtained  from  the 
protocol  to  Bob. 

2.  Bob  computes  T3[l]  =  i?"[l]  •  i?x[l]  •  R,z[l\. 

Step  III:  Repeating 

1.  Repeat  the  Step  I  and  Step  II  to  compute  T[i], 
T\[i\,  T2[i\  and  T3[i]  for  i  £  [2,  N]. 

2.  Alice  then  gets 

M  =  Ehm  =  +  RM  *  (^[i]  + 

Ry[i])*(Z\i}+Rz\i]). 

3.  Bob  gets 

[b]  =  Ell  m  =  Ehm  +  Ry[i\)  *  (Z[i\  + 
Rz[i\)- 

4.  Carol  gets 

[C]  =  EhTi[i\  =  Eh  (X[i\.Ry[i\  +  Rx[i\-Y[i\  + 
Rx[i\  ■  Ry[i])  ■  Z(i\ ,  and 

[d]  =  Eh  T2[i\  =  Eti(^M  •  Y[i)+X\i]  ■  Ry[i]  + 
Rx\i]-Y(i}  +  Rx\i}-R'y\i])-Rz\i}. 

Note  that  Eili  x[i]  ■  Y\i]  ■  Z\i }  =  EiIiTo[*]  = 
[a]  -  [b\  [c]  -  [d]. 

Theorem  1.  Protocol  1  is  secure  such  that  Alice  can¬ 
not  learn  Y  and  Bob  cannot  learn  X  either. 

Proof.  The  number  X[i\  =  X[i]  +  Rx[i \  is  all  what 
Bob  gets.  Because  of  the  randomness  and  the  se¬ 
crecy  of  Rx[i],  Bob  cannot  find  out  X [i] .  According 
to  the  protocol,  Alice  gets  (1)  Y[i]  =  Y[i]  +  Ry[i\, 
(2)  Z[i\  =  X[i]  ■  Y[i]  +  ( ry[i\  -  Uy[i ]),  and  (3)  rx[i], 
Rx  [i] ,  where  rx  [*]  +  ry  [*]  =  Rx  [i]  •  Ry  [t] .  We  will  show 
that  for  any  arbitrary  Y'[i\,  there  exists  r'y[i\,  R'y[i ] 
and  Uy[i]  that  satisfies  the  above  equations.  Assume 
Y'[i]  is  an  arbitrary  number.  Let  R'y[i]  =  Y[i ]  —  Y'[i\, 
r'y [i\  =  Rx[i\-Ry[i\-rx[i\,  and  U'y[i\  =  X[i\-Y'[i\+r'y[i]- 
Therefore,  Alice  has  (1)  Y[i]  =  Y'\i\  +  R'y[i\,  (2) 
Z[i\  =  X[i\  •  Y'[i ]  +  (r'y[i\  -  Uy[i ])  and  (3)  rx[i],  Rx[i\, 
where  rx  [i]  +  r'y  [*]  =  Rx  [i]  •  R'y  [i] .  Thus,  from  what  Al¬ 
ice  learns,  there  exists  infinite  possible  values  for  Y[i]. 
Therefore,  Alice  cannot  know  Y  and  neither  can  Bob 
know  X.  □ 

Theorem  2.  Protocol  2  is  secure  such  that  Alice  can¬ 
not  learn  Y  and  Z ,  Bob  cannot  learn  X  and  Z ,  and 
Carol  cannot  learn  X  and  Y . 

Proof.  According  to  the  protocol,  Alice  obtains 
(1)  {Y[i}  +  Ry[i\),  and  (2 )(Z[i]  +  Rz[i}). 

Bob  gets  Rx  [i]  •  Rz  [*] . 

Carol  gets 

(1)  (X[i]  ■  Ry[i\  +  Rx[i]  ■  Y[i]  +  Rx[i]  ■  Rv[i\)  and 
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(2)  (X  [*]  •  Y  [*]  +  X  [*]  •  Ry  [i]  +  Rx  [i]  •  Y  [i]  +  Rx  [i]  •  R'y  [*] ) . 

Since  Rx[i),  Ry[i\(=  ( R'y[i ]  +  Ry[i]))  and  Rz[i\  are 
arbitrary  random  numbers.  From  what  Alice  learns, 
there  exists  infinite  possible  values  for  Y[i\  and  Z[i\. 
From  what  Bob  learns,  there  also  exists  infinite  possible 
values  for  Z[i\.  From  what  Carol  learns,  there  still 
exists  infinite  possible  values  for  X[i\  and  Y [*] . 

Therefore,  Alice  cannot  learn  Y  and  Z,  Bob  cannot 
learn  X  and  Z,  and  Carol  cannot  learn  X  and  Y  either. 

□ 

5  ANALYSIS 

5.1  Security  analysis 

5.1.1  Why  Choose  A  Large  Domain 

In  our  protocols,  all  the  random  numbers  are  gener¬ 
ated  from  a  very  large  domain  (e.g.,  the  integer  do¬ 
main).  If  the  random  numbers  are  generated  from  a 
small  domain,  then  one  party  might  get  some  informa¬ 
tion  about  the  other  parties’  private  data.  For  example, 
in  the  Protocol  1,  if  the  elements  of  Y[i]  are  in  the  do¬ 
main  of  [0,  300],  and  we  also  know  the  random  numbers 
are  generated  from  [0,  500],  then  if  an  element  of  the 
vector  Y[i)  +  Ry[i]  is  650,  we  know  the  original  element 
in  the  vector  Yt  is  larger  than  150. 

5.1.2  Malicious  Model  Analysis 

In  this  paper,  our  algorithm  is  based  on  the  semi-honest 
model,  where  all  the  parties  behave  honestly  and  co¬ 
operatively  during  the  protocol  execution.  However, 
in  practice,  one  of  the  parties  (e.g.,  Bob)  may  be  ma¬ 
licious  in  that  it  wants  to  gain  true  values  of  other 
parties’  data  by  purposely  manipulating  its  own  data 
before  executing  the  protocols.  For  example,  in  Pro¬ 
tocol  1  Bob  wants  to  know  whether  X[i]  =  1  for  some 
i.  He  may  make  up  a  set  of  numbers  with  all,  but  the 
itli,  values  being  set  to  0’s  (i.e.,  Y[i\  =  1  and  Y[j]  =  0 
for  j  ^  i).  According  to  Protocol  1,  if  Bob  obtains 
i  Ux[i\  +  JY_i  Uy[i\,  indicating  the  total  number 
of  counts  for  both  X  and  Y  being  1,  then  Bob  can  know 
that  X[i\  is  0  if  the  above  result  is  0  and  X[i]  is  1  if  the 
above  result  is  1.  To  deal  with  this  problem,  we  may 
randomly  select  a  party  to  hold  the  frequency  counts. 
For  example,  let’s  consider  the  scenario  of  three  par¬ 
ties.  Without  loss  of  generality,  we  assume  Bob  is  a 
malicious  party.  The  chance  that  Bob  gets  chosen  to 
hold  the  frequency  counts  is  | .  We  consider  the  follow¬ 
ing  two  cases. (  Assume  that  the  probability  of  samples 
in  a  sample  space  are  equally  likely.) 


1.  Make  a  correct  guess  of  both  Alice’s  and  Carol’s 
values. 

If  Bob  is  not  chosen  to  hold  the  frequency  counts, 
he  then  chooses  to  randomly  guess  and  the  prob¬ 
ability  for  him  to  make  a  correct  guess  is  / .  In 
case  Bob  is  chosen,  if  the  product  result  is  1  (with 
the  probability  of  |),  he  then  concludes  that  both 
Alice  and  Carol  have  value  1;  if  the  product  result 
is  0  (with  the  probability  of  |),  he  would  have  a 
chance  of  |  to  make  a  correct  guess.  Therefore, 
we  have  |  *  i  +  i(i  +  §  *  §)  ~  33%.  Note  that  the 
chance  for  Bob  to  make  a  correct  guess,  without 
his  data  being  purposely  manipulated,  is  25%. 

2.  Make  a  correct  guess  for  only  one  party’s  (e.g., 
Alice)  value. 

If  Bob  is  not  chosen  to  hold  the  frequency  counts, 
the  chance  that  his  guess  is  correct  is  In  case 
Bob  is  chosen,  if  the  product  result  is  1,  he  then 
knows  the  Alice’s  value  with  certainty;  if  the  result 
is  0,  there  are  two  possibilities  that  need  to  be 
considered:  (1)  if  Alice’s  value  is  0,  then  the  chance 
that  his  guess  is  correct  is  |;  (2)  if  Alice’s  value  is 
1,  then  the  chance  that  his  guess  is  correct  is 
Therefore,  we  have  f  *  5  +  |(j  +  f  (f  *  f  + 1 *  |))  ~ 
56%.  However,  if  Bob  chooses  to  make  a  random 
guess,  he  then  has  50%  of  chance  to  be  correct. 

It  can  be  shown  that  the  ratio  for  Bob  to  make  a  cor¬ 
rect  guess  with/ without  manipulating  his  data  in  case 
1  is  (n+l)/n  and  in  case  2  is  approaching  1  with  an 
exponential  rate  of  n  (ss  2M"~1')),  where  n  is  the  num¬ 
ber  of  parties.  The  probability  for  a  malicious  party 
to  make  a  correct  guess  about  other  parties’  values  de¬ 
creases  significantly  as  the  number  of  parties  increases. 

5.1.3  How  to  deal  with  information  disclosure 
by  the  inference  from  the  results 

Assume  the  association  rule,  Druglnjection  =>• 
Hepatitis,  is  what  we  get  from  the  collaborative  asso¬ 
ciation  rule  mining,  and  this  rule  has  99%  confidence 
level  (i.e.,  P(Hepatitis\DrugInjection)  =  0.99).  Now 
given  a  data  item  itemi  with  Alice: (Drug-Injection), 
Alice  can  figure  out  Bob’s  data  (i.e.,  Bob: (Hepatitis) 
is  in  itemi)  with  confidence  99%  (but  not  vice  versa). 
Such  an  inference  problem  exists  whenever  the  items 
of  the  association  rule  is  small  and  its  confidence  mea¬ 
sure  is  high.  To  deal  with  the  information  disclosure 
through  inference,  we  may  enforce  the  parties  to  ran¬ 
domize  their  data  as  in  [4]  with  some  probabilities  be¬ 
fore  conducting  the  association  rule  mining. 
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5.1.4  How  to  deal  with  the  repeat  use  of  pro¬ 
tocol 

A  malicious  party  (e.g.,  Bob)  may  ask  to  run  the  pro¬ 
tocol  multiple  times  with  different  set  of  values  at  each 
time  by  manipulating  his  Uy[.]  value.  If  other  parties 
respond  with  honest  answers,  then  this  malicious  party 
may  have  chance  to  obtain  actual  values  of  other  par¬ 
ties.  To  avoid  this  type  of  disclosure,  constraints  must 
be  imposed  on  the  number  of  repetitions. 

5.2  Communication  Analysis 

There  are  three  sources  which  contribute  to  the  total 
bit-wise  communication  cost  for  the  above  protocols: 

(1)  the  number  of  rounds  of  communication  to  com¬ 
pute  the  number  product  for  a  single  value  (denoted 
by  NumRod );  (2)  the  maximum  number  of  bits  for  the 
values  involved  in  the  protocols  (denoted  by  M);  (2) 
the  number  of  times  (N)  that  the  protocols  are  applied. 
The  total  cost  can  be  expressed  by  NumRod  *  M  *  N 
where  NumRod  and  M  are  constants  for  each  proto¬ 
col.  Therefore  the  communication  cost  is  O(N).  N  is  a 
large  number  when  the  number  of  parties  is  big  since 
N  exponentially  increases  as  the  number  of  parties  ex¬ 
pands. 

6  CONCLUDING  REMARKS 

In  this  paper,  we  consider  the  problem  of  privacy¬ 
preserving  collaborative  data  mining  with  inputs  of  bi¬ 
nary  data  sets.  In  particular,  we  study  how  multiple 
parties  to  jointly  conduct  association  rule  mining  on 
private  data.  We  provided  an  efficient  association  rule 
mining  procedure  to  carry  out  such  a  computation.  In 
order  to  securely  collecting  necessary  statistical  mea¬ 
sures  from  data  of  multiple  parties,  we  have  developed 
a  secure  protocol,  namely  the  number  product  proto¬ 
col,  for  multiple-party  to  jointly  conduct  their  desired 
computations.  We  also  discussed  the  malicious  model 
and  approached  it  by  distributing  the  measure  of  fre¬ 
quency  counts  to  different  parties,  and  suggested  the 
use  of  the  randomization  method  to  reduce  the  infer¬ 
ence  of  data  disclosure. 

In  our  future  work,  we  will  extend  our  method  to 
deal  with  non-binary  data  sets.  We  will  also  apply  our 
technique  to  other  data  mining  computations,  such  as 
privacy-preserving  clustering . 
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